Unsupervised Real Estate Price Prediction - Final Report - Deliverable 1

DTSA 5510 Unsupervised Algorithms in Machine Learning - University of Colorado Boulder

Author

Andrew Simms

Published

December 31, 2023

1 Introduction

1.1 Project Topic

This project utilizes unsupervised and supervised machine learning (ML) algorithms to perform price prediction on real-world real estate listings downloaded from Zillow.com and with additional listing information scraped from Redfin.com. To start Section 3 describes the source of the data and introduces the dataset. Section 4 uses common techniques to clean the dataset, removing outliers, and sanitizing columns, building a dataset that can be input into varied ML algorithms. Section 5 explores the dataset and produces visualizations for the author and audience to gain an initial understanding of the data. This section also explores the interactions between features and begins to analyze collinearity.

In Section 6 we dive deep into both unsupervised and supervised ML algorithms and build models to cluster, then regress, on the data, aiming to create subsets of data that more accurately model price than one large dataset. Section 7 discusses the results of the modeling and aims to select a best path for price prediction. Finally Section 8 details the outcomes of this project and discusses areas of further research. All files and code for this project can be found here: https://github.com/simmsa/dtsa_5510_final.

To narrow down the listings, provide simpler data collection, and make the result more relevant for the author, this document focuses on real estate listings in the Denver Colorado metropolitan area. As this project utilizes a relatively small amount of real world data, the outcomes may not be definitive. Through this project we will aim to highlight the shortcomings of this analysis and discuss methods for improvement of data size and data quality. As we build our models we are aiming to find a good balance of accuracy and execution speed. We want to be careful of creating a model that is simple for this use case, but too complex for a larger problem.

1.2 Project Goal

The goal of this project is to build pricing model to predict real estate listing prices on new real estate listings to seek out listings that may be undervalued. To achieve this goal we plan to use a ML divide and conquer strategy, which uses clustering and principal component techniques to reduce the data into smaller groups. These smaller groups will be passed into many supervised regression algorithms and the accuracy will be combined to determine the best performing combination of models.

For this dataset, our target variable is price, a floating-point value. To model this variable regression algorithms are a suitable choice. It is worth noting that typical regression algorithms are compatible with floating-point features. We plan to convert as much of the data as possible to floating point values. As this aspect is not the primary focus of this project, certain features may be omitted to streamline the construction and analysis of the models. Given this, we strive to keep the core values of the data. And our goal for cleaning and processing the data is to build a unified source of truth that will be utilized by all ML models.

1.3 Project Plan

In Figure 1 we outline the steps necessary to reach our goal of predicting the price of a real estate listing.

flowchart TD
    A("Data Cleaning / Wrangling")
    B("Imputation")
    C("Feature Selection")
    subgraph unsupervised ["Unsupervised ML"]
        subgraph clustering ["Clustering"]
            D("KMeans")
            E("AgglomerativeClustering")
        end
        subgraph dimred ["Dimensionality Reduction"]
            L("Principal Component Analysis")
        end
    end

    subgraph supervised ["Supervised Regression"]
        F("Linear Regression")
        G(AdaBoostRegressor)
        H(XGBRegressor)
    end

    J["Analysis"]

    K["Selection"]


    A --> B --> C

    C --> unsupervised --> supervised --> J --> K
Figure 1: Unsupervised Real Estate Price Prediction Flowchart

2 Python Setup

The following code section includes the python libraries used to execute the code contained in this document. See Section 9 for complete details on the environment and libraries used to execute the code in this document.

import itertools
import copy
import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostRegressor
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error, silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import StandardScaler

import tensorflow as tf

import xgboost as xgb

sns.set_theme()

2.1 Visualization Functions

def brand_plot(has_subplots=False):
    """
    Add branding elements to matplotlib.plt objects with a title and copyright

    Parameters:
    - has_subplots (bool): Whether the plot has subplots. Default is False.

    Returns:
    None
    """

    if not has_subplots:
        plt.suptitle("Real Estate Listing Price Prediction")

    txt = "DTSA 5510 - 2023 Summer 1 - Andrew Simms"
    plt.figtext(
        0.95,
        -0.01,
        txt,
        wrap=True,
        horizontalalignment="right",
        fontsize=6,
        fontstyle="italic",
    )
    plt.tight_layout()

3 Data Information

3.1 Data Source

In the United States (US) real estate listing data is typically not readily available to the consumer. Typically the Multiple Listing Service (MLS) contains these listings and which they make available to realtors. To provide a product, companies like Zillow and Redfin acquire these listings and provide a service that makes it easy for users to find real estate properties for sale. Ideally, to source real estate listing data, we would use MLS, but they do not offer a readily available API. But we can use data from Zillow and Redfin, we just have to find methods for scraping each respective site.

Data is acquired from Zillow using our custom made scraper.py python script. This script interfaces with the Zillow GetSearchPageState API and downloads current real estate listings by zip code for the included zip codes. A framework for this type of scraping can be found in this Zillow web scraping tutorial. This json is then formatted with formatter.py and saved to a csv file.

To add additional features we use the addresses from Zillow and query additional data from Redfin using the Python-Redfin library. Querying Redfin adds additional information about school districts, home features, neighborhood, and listing condition that should be able add additional features that improve the pricing model. This data is combined with the Zillow data and save into a csv file.

3.1.1 Data Acquisition Flowchart

flowchart LR
    subgraph daq ["Data Acquisition"]
        direction LR
        subgraph zillow ["Zillow.com"]
            A["scraper.py"]
            B["GetSearchPageState API"]
            C["Zillow Listings"]
            D["json"]
            E["formatter.py"]
            F["csv"]
        end

        subgraph redfin ["Redfin.com"]
            G["redfin_scraper.py"]
            H["csv"]
        end

        K["Final Output"]


        A --> B --> C --> D --> E --> F

        F -- Addresses --> G --> H --> K
    end
Figure 2: Data Acquisition Flowchart

The following code reads the saved csv dataset into a pandas DataFrame object. In Section 3.2 columns 0 to 87 originate from Zillow and columns 88 and above are from Redfin.

df = pd.read_csv("2023_04_22-17_57_52_10_mi_radius_unique_denver_area_w_redfin.csv")

3.2 Data Description

The initial data is shown using df.info() as executed below:

df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 718 entries, 0 to 717
Data columns (total 161 columns):
 #    Column                                                   Non-Null Count  Dtype  
---   ------                                                   --------------  -----  
 0    Unnamed: 0                                               718 non-null    int64  
 1    zpid                                                     718 non-null    int64  
 2    price                                                    718 non-null    object 
 3    priceLabel                                               718 non-null    object 
 4    beds                                                     714 non-null    float64
 5    baths                                                    718 non-null    float64
 6    area                                                     715 non-null    float64
 7    statusType                                               718 non-null    object 
 8    statusText                                               718 non-null    object 
 9    isFavorite                                               718 non-null    bool   
 10   isUserClaimingOwner                                      718 non-null    bool   
 11   isUserConfirmedClaim                                     718 non-null    bool   
 12   imgSrc                                                   718 non-null    object 
 13   hasImage                                                 718 non-null    bool   
 14   visited                                                  718 non-null    bool   
 15   listingType                                              26 non-null     object 
 16   shouldShowZestimateAsPrice                               718 non-null    bool   
 17   detailUrl                                                718 non-null    object 
 18   pgapt                                                    718 non-null    object 
 19   sgapt                                                    718 non-null    object 
 20   has3DModel                                               718 non-null    bool   
 21   hasVideo                                                 718 non-null    bool   
 22   isHomeRec                                                718 non-null    bool   
 23   address                                                  718 non-null    object 
 24   info3String                                              708 non-null    object 
 25   info1String                                              708 non-null    object 
 26   brokerName                                               708 non-null    object 
 27   hasAdditionalAttributions                                718 non-null    bool   
 28   isFeaturedListing                                        718 non-null    bool   
 29   isShowcaseListing                                        718 non-null    bool   
 30   availabilityDate                                         0 non-null      float64
 31   timeOnZillow                                             718 non-null    int64  
 32   latLong_latitude                                         718 non-null    float64
 33   latLong_longitude                                        718 non-null    float64
 34   variableData_type                                        423 non-null    object 
 35   variableData_text                                        423 non-null    object 
 36   variableData_data_isFresh                                18 non-null     object 
 37   hdpData_homeInfo_zpid                                    718 non-null    int64  
 38   hdpData_homeInfo_zipcode                                 718 non-null    int64  
 39   hdpData_homeInfo_city                                    718 non-null    object 
 40   hdpData_homeInfo_state                                   718 non-null    object 
 41   hdpData_homeInfo_latitude                                718 non-null    float64
 42   hdpData_homeInfo_longitude                               718 non-null    float64
 43   hdpData_homeInfo_price                                   718 non-null    float64
 44   hdpData_homeInfo_datePriceChanged                        244 non-null    float64
 45   hdpData_homeInfo_bathrooms                               718 non-null    float64
 46   hdpData_homeInfo_bedrooms                                714 non-null    float64
 47   hdpData_homeInfo_livingArea                              715 non-null    float64
 48   hdpData_homeInfo_homeType                                718 non-null    object 
 49   hdpData_homeInfo_homeStatus                              718 non-null    object 
 50   hdpData_homeInfo_daysOnZillow                            718 non-null    int64  
 51   hdpData_homeInfo_isFeatured                              718 non-null    bool   
 52   hdpData_homeInfo_shouldHighlight                         718 non-null    bool   
 53   hdpData_homeInfo_zestimate                               608 non-null    float64
 54   hdpData_homeInfo_rentZestimate                           642 non-null    float64
 55   hdpData_homeInfo_listing_sub_type_is_FSBA                664 non-null    object 
 56   hdpData_homeInfo_priceReduction                          213 non-null    object 
 57   hdpData_homeInfo_isUnmappable                            718 non-null    bool   
 58   hdpData_homeInfo_isPreforeclosureAuction                 718 non-null    bool   
 59   hdpData_homeInfo_homeStatusForHDP                        718 non-null    object 
 60   hdpData_homeInfo_priceForHDP                             718 non-null    float64
 61   hdpData_homeInfo_priceChange                             244 non-null    float64
 62   hdpData_homeInfo_isNonOwnerOccupied                      718 non-null    bool   
 63   hdpData_homeInfo_isPremierBuilder                        718 non-null    bool   
 64   hdpData_homeInfo_isZillowOwned                           718 non-null    bool   
 65   hdpData_homeInfo_currency                                718 non-null    object 
 66   hdpData_homeInfo_country                                 718 non-null    object 
 67   hdpData_homeInfo_taxAssessedValue                        703 non-null    float64
 68   hdpData_homeInfo_lotAreaValue                            714 non-null    float64
 69   hdpData_homeInfo_lotAreaUnit                             714 non-null    object 
 70   hdpData_homeInfo_listing_sub_type_is_openHouse           106 non-null    object 
 71   hdpData_homeInfo_listing_sub_type_is_comingSoon          27 non-null     object 
 72   hdpData_homeInfo_openHouse                               106 non-null    object 
 73   hdpData_homeInfo_open_house_info_open_house_showing      106 non-null    object 
 74   variableData                                             0 non-null      float64
 75   hdpData_homeInfo_listing_sub_type_is_newHome             26 non-null     object 
 76   hdpData_homeInfo_newConstructionType                     26 non-null     object 
 77   hdpData_homeInfo_videoCount                              23 non-null     float64
 78   communityName                                            21 non-null     object 
 79   isPropertyResultCDP                                      3 non-null      object 
 80   style                                                    1 non-null      object 
 81   isCdpResult                                              3 non-null      object 
 82   unitCount                                                3 non-null      float64
 83   hdpData_homeInfo_group_type                              3 non-null      object 
 84   hdpData_homeInfo_priceSuffix                             3 non-null      object 
 85   hdpData_homeInfo_providerListingID                       21 non-null     float64
 86   hdpData_homeInfo_unit                                    7 non-null      object 
 87   hdpData_homeInfo_listing_sub_type_is_bankOwned           1 non-null      object 
 88   schools_rating                                           681 non-null    float64
 89   school_district                                          681 non-null    object 
 90   num_stories                                              510 non-null    float64
 91   year_built                                               639 non-null    float64
 92   year_renovated                                           557 non-null    float64
 93   sq_ft_finished                                           637 non-null    float64
 94   total_sq_ft                                              639 non-null    float64
 95   lot_sq_ft                                                673 non-null    float64
 96   taxable_land_value                                       671 non-null    float64
 97   taxable_improvement_value                                641 non-null    float64
 98   county_name                                              681 non-null    object 
 99   basement_information_BASEMENT_SQUARE_FEET                491 non-null    object 
 100  basement_information_BASEMENT_FINISH_CODE                307 non-null    object 
 101  basement_information_BASEMENT_TYPE_CODE                  353 non-null    object 
 102  bathroom_information_FULL_BATHS                          559 non-null    float64
 103  bathroom_information_HALF_BATHS                          210 non-null    float64
 104  heating_&_cooling_HEATING_TYPE_CODE                      625 non-null    object 
 105  room_information_TOTAL_ROOMS                             218 non-null    float64
 106  exterior_information_CONDITION_CODE                      226 non-null    object 
 107  exterior_information_STYLE_CODE                          333 non-null    object 
 108  exterior_information_CONSTRUCTION_TYPE_CODE              501 non-null    object 
 109  exterior_information_BUILDING_QUALITY_CODE               631 non-null    object 
 110  exterior_information_EXTERIOR_WALL_CODE                  220 non-null    object 
 111  exterior_information_ROOF_COVER_CODE                     87 non-null     object 
 112  exterior_information_ROOF_TYPE_CODE                      121 non-null    object 
 113  exterior_information_BUILDING_CODE                       545 non-null    object 
 114  property_information_SUBDIVISION_NAME                    638 non-null    object 
 115  property_information_LIVING_SQUARE_FEET                  635 non-null    object 
 116  property_information_GROUND_FLOOR_SQUARE_FEET            623 non-null    object 
 117  property_information_BUILDING_SQUARE_FEET                637 non-null    object 
 118  property_information_STORIES_NUMBER                      510 non-null    float64
 119  property_information_STORIES_CODE                        496 non-null    object 
 120  property_information_LEGAL_DESCRIPTION                   671 non-null    object 
 121  parking_&_garage_information_PARKING_TYPE                599 non-null    object 
 122  parking_&_garage_information_GARAGE_PARKING_SQUARE_FEET  574 non-null    object 
 123  parking_&_garage_information_GARAGE_CODE                 578 non-null    object 
 124  lot_information_NUMBER_OF_BUILDINGS                      638 non-null    float64
 125  lot_information_LAND_SQUARE_FOOTAGE                      671 non-null    object 
 126  lot_information_LEGAL_LOT_NUMBER                         629 non-null    object 
 127  lot_information_LEGAL_BLOCK_NUMBER                       436 non-null    object 
 128  lot_information_MUNICIPALITY_NAME                        197 non-null    object 
 129  lot_information_ACRES                                    671 non-null    float64
 130  lot_information_COUNTY_USE_DESCRIPTION                   231 non-null    object 
 131  assessor_information_ASSESSED_YEAR                       671 non-null    float64
 132  assessor_information_TAX_AREA                            179 non-null    object 
 133  heating_&_cooling_AIR_CONDITIONING_CODE                  219 non-null    object 
 134  fireplace_information_FIREPLACE_INDICATOR                378 non-null    object 
 135  property_information_ADJUSTED_GROSS_SQUARE_FEET          477 non-null    object 
 136  parking_&_garage_information_PARKING_SPACES              6 non-null      float64
 137  lot_information_ZONING_CODE                              282 non-null    object 
 138  lot_information_STATE_USE_DESCRIPTION                    445 non-null    object 
 139  bathroom_information_THREE_QUARTERS_BATHS                374 non-null    float64
 140  fireplace_information_NUMBER_OF_FIREPLACES               360 non-null    float64
 141  location_information_LOCATION_INFLUENCE_CODE             38 non-null     object 
 142  fireplace_information_FIREPLACE_TYPE_CODE                18 non-null     object 
 143  exterior_information_BUILDING_IMPROVEMENT_CODE           184 non-null    object 
 144  pool_information_POOL_CODE                               43 non-null     object 
 145  pool_information_POOL_INDICATOR                          47 non-null     object 
 146  lot_information_DEPTH_FOOTAGE                            25 non-null     float64
 147  lot_information_FRONT_FOOTAGE                            25 non-null     float64
 148  bathroom_information_BATH_FIXTURES                       86 non-null     float64
 149  heating_&_cooling_FUEL_CODE                              87 non-null     object 
 150  property_information_GROSS_SQUARE_FEET                   84 non-null     object 
 151  utility_information_UTILITIES_CODE                       87 non-null     object 
 152  property_information_SA_LGL_DSCRPTN                      2 non-null      object 
 153  property_information_SA_CONSTRUCTION_CODE                1 non-null      object 
 154  property_information_SA_HEATING_COOLING                  1 non-null      object 
 155  property_information_SA_FIREPLACE_CODE                   2 non-null      object 
 156  property_information_SA_GARAGE_CARPORT                   1 non-null      object 
 157  property_information_SA_NBR_UNITS                        1 non-null      float64
 158  exterior_information_FOUNDATION_CODE                     2 non-null      object 
 159  utility_information_SEWER_CODE                           3 non-null      object 
 160  utility_information_WATER_CODE                           3 non-null      object 
dtypes: bool(19), float64(46), int64(6), object(90)
memory usage: 810.0+ KB

This shows that that the initial dataset has 718 rows and 161 columns. The dataset is 810 kB in size. Both of these values show that the has sufficient information for further processing.

In this analysis we can see that there are many columns with large numbers of null values, additionally, while we have many features with float64 values, we may need to convert some features to floating point values.

3.3 Data Filtering Parameters

As we start to query the data we will save our notes on the dataset in filter_params. This is meant to capture values that we will use for data cleaning in Section 4. Note that we will use the python convention of using all caps for variables that should not be changed and are global to the file.

filter_params = {}

3.4 Initial Filtering/High Level Cleaning

Before we begin our analysis must perform some initial data cleaning to remove obvious shortcomings in our listing data.

3.4.1 Filtering Columns

Our first step will be the removal of columns with large counts of null values. While some of the columns have high quality data, we need both high quality and high quantity data to input into our ML models. To filter this data out we count the number of null values in each column and drop columns that have null value percentages higher than NULL_MAX_PERCENT:

filter_params["NULL_MAX_PERCENT"] = 0.25

null_counts = df.isnull().sum()

columns_to_drop = null_counts[null_counts / len(df) > filter_params["NULL_MAX_PERCENT"]].index

df_before_columns_count = len(df.columns)
df = df.drop(columns_to_drop, axis=1)
df_after_columns_count = len(df.columns)

print(f"Before Filtering Column Count: {df_before_columns_count}")
print(f"After Filtering Column Count: {df_after_columns_count}")
print(f"Column Filtering Removal Count: {df_before_columns_count - df_after_columns_count}")
Before Filtering Column Count: 161
After Filtering Column Count: 89
Column Filtering Removal Count: 72
# df.columns[1:]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 718 entries, 0 to 717
Data columns (total 89 columns):
 #   Column                                                   Non-Null Count  Dtype  
---  ------                                                   --------------  -----  
 0   Unnamed: 0                                               718 non-null    int64  
 1   zpid                                                     718 non-null    int64  
 2   price                                                    718 non-null    object 
 3   priceLabel                                               718 non-null    object 
 4   beds                                                     714 non-null    float64
 5   baths                                                    718 non-null    float64
 6   area                                                     715 non-null    float64
 7   statusType                                               718 non-null    object 
 8   statusText                                               718 non-null    object 
 9   isFavorite                                               718 non-null    bool   
 10  isUserClaimingOwner                                      718 non-null    bool   
 11  isUserConfirmedClaim                                     718 non-null    bool   
 12  imgSrc                                                   718 non-null    object 
 13  hasImage                                                 718 non-null    bool   
 14  visited                                                  718 non-null    bool   
 15  shouldShowZestimateAsPrice                               718 non-null    bool   
 16  detailUrl                                                718 non-null    object 
 17  pgapt                                                    718 non-null    object 
 18  sgapt                                                    718 non-null    object 
 19  has3DModel                                               718 non-null    bool   
 20  hasVideo                                                 718 non-null    bool   
 21  isHomeRec                                                718 non-null    bool   
 22  address                                                  718 non-null    object 
 23  info3String                                              708 non-null    object 
 24  info1String                                              708 non-null    object 
 25  brokerName                                               708 non-null    object 
 26  hasAdditionalAttributions                                718 non-null    bool   
 27  isFeaturedListing                                        718 non-null    bool   
 28  isShowcaseListing                                        718 non-null    bool   
 29  timeOnZillow                                             718 non-null    int64  
 30  latLong_latitude                                         718 non-null    float64
 31  latLong_longitude                                        718 non-null    float64
 32  hdpData_homeInfo_zpid                                    718 non-null    int64  
 33  hdpData_homeInfo_zipcode                                 718 non-null    int64  
 34  hdpData_homeInfo_city                                    718 non-null    object 
 35  hdpData_homeInfo_state                                   718 non-null    object 
 36  hdpData_homeInfo_latitude                                718 non-null    float64
 37  hdpData_homeInfo_longitude                               718 non-null    float64
 38  hdpData_homeInfo_price                                   718 non-null    float64
 39  hdpData_homeInfo_bathrooms                               718 non-null    float64
 40  hdpData_homeInfo_bedrooms                                714 non-null    float64
 41  hdpData_homeInfo_livingArea                              715 non-null    float64
 42  hdpData_homeInfo_homeType                                718 non-null    object 
 43  hdpData_homeInfo_homeStatus                              718 non-null    object 
 44  hdpData_homeInfo_daysOnZillow                            718 non-null    int64  
 45  hdpData_homeInfo_isFeatured                              718 non-null    bool   
 46  hdpData_homeInfo_shouldHighlight                         718 non-null    bool   
 47  hdpData_homeInfo_zestimate                               608 non-null    float64
 48  hdpData_homeInfo_rentZestimate                           642 non-null    float64
 49  hdpData_homeInfo_listing_sub_type_is_FSBA                664 non-null    object 
 50  hdpData_homeInfo_isUnmappable                            718 non-null    bool   
 51  hdpData_homeInfo_isPreforeclosureAuction                 718 non-null    bool   
 52  hdpData_homeInfo_homeStatusForHDP                        718 non-null    object 
 53  hdpData_homeInfo_priceForHDP                             718 non-null    float64
 54  hdpData_homeInfo_isNonOwnerOccupied                      718 non-null    bool   
 55  hdpData_homeInfo_isPremierBuilder                        718 non-null    bool   
 56  hdpData_homeInfo_isZillowOwned                           718 non-null    bool   
 57  hdpData_homeInfo_currency                                718 non-null    object 
 58  hdpData_homeInfo_country                                 718 non-null    object 
 59  hdpData_homeInfo_taxAssessedValue                        703 non-null    float64
 60  hdpData_homeInfo_lotAreaValue                            714 non-null    float64
 61  hdpData_homeInfo_lotAreaUnit                             714 non-null    object 
 62  schools_rating                                           681 non-null    float64
 63  school_district                                          681 non-null    object 
 64  year_built                                               639 non-null    float64
 65  year_renovated                                           557 non-null    float64
 66  sq_ft_finished                                           637 non-null    float64
 67  total_sq_ft                                              639 non-null    float64
 68  lot_sq_ft                                                673 non-null    float64
 69  taxable_land_value                                       671 non-null    float64
 70  taxable_improvement_value                                641 non-null    float64
 71  county_name                                              681 non-null    object 
 72  bathroom_information_FULL_BATHS                          559 non-null    float64
 73  heating_&_cooling_HEATING_TYPE_CODE                      625 non-null    object 
 74  exterior_information_BUILDING_QUALITY_CODE               631 non-null    object 
 75  exterior_information_BUILDING_CODE                       545 non-null    object 
 76  property_information_SUBDIVISION_NAME                    638 non-null    object 
 77  property_information_LIVING_SQUARE_FEET                  635 non-null    object 
 78  property_information_GROUND_FLOOR_SQUARE_FEET            623 non-null    object 
 79  property_information_BUILDING_SQUARE_FEET                637 non-null    object 
 80  property_information_LEGAL_DESCRIPTION                   671 non-null    object 
 81  parking_&_garage_information_PARKING_TYPE                599 non-null    object 
 82  parking_&_garage_information_GARAGE_PARKING_SQUARE_FEET  574 non-null    object 
 83  parking_&_garage_information_GARAGE_CODE                 578 non-null    object 
 84  lot_information_NUMBER_OF_BUILDINGS                      638 non-null    float64
 85  lot_information_LAND_SQUARE_FOOTAGE                      671 non-null    object 
 86  lot_information_LEGAL_LOT_NUMBER                         629 non-null    object 
 87  lot_information_ACRES                                    671 non-null    float64
 88  assessor_information_ASSESSED_YEAR                       671 non-null    float64
dtypes: bool(19), float64(28), int64(6), object(36)
memory usage: 406.1+ KB

3.4.2 Filtering Price Rows

Next we will drop rows where the price does not exist:

df = df.dropna(subset='hdpData_homeInfo_price')

3.5 Notable Columns

As our dataset has a high number of features we will focus our efforts on the columns that we expect will provide the highest value to our goal of accurate price prediction.

3.5.1 Target Column, hdpData_homeInfo_price

The target column hdpData_homeInfo_price contains real estate listing prices, expressed using floating point numbers as US dollars ($). If the real estate market is operating efficiently this number will properly capture the value of all features of the listing. Zillow is the source for this column.

df["hdpData_homeInfo_price"].info()
<class 'pandas.core.series.Series'>
RangeIndex: 718 entries, 0 to 717
Series name: hdpData_homeInfo_price
Non-Null Count  Dtype  
--------------  -----  
718 non-null    float64
dtypes: float64(1)
memory usage: 5.7 KB
df['hdpData_homeInfo_price'].describe().apply(lambda x: format(x, 'f'))
count         718.000000
mean      1252138.281337
std       1603089.002523
min             0.000000
25%        649900.000000
50%        840000.000000
75%       1275000.000000
max      23799000.000000
Name: hdpData_homeInfo_price, dtype: object

As we can see from the code sections above hdpData_homeInfo_price has 718 values that range from $0 to $2,379,900, with the mean value being ~$ 1,250,000 and the median value being $840,000. We will explore this column more completely when we clean this column in Section 5.2.1.

We do see that the minimum price is $0, which most likely means a null value. Lets notate this in the filter parameters by setting a minimum price. This value is a best guess based on local knowledge:

filter_params["MINIMUM_PRICE"]= 250_000

3.5.2 Feature Column, hdpData_homeInfo_livingArea

The feature column hdpData_homeInfo_livingArea contains the measurement of square footage in feet of the living area of the listing (\(\text{ft}^2\)). The source of this column is Zillow. It is a floating point value of type float64. This measurement typically does not include the garage, or unfinished living space. Effectively this number can be interpreted as the interior size of the listing. There is no guarantee that this measurement is absolutely correct, and the seller may have an incentive to inflate this number in an attempt to get a higher price.

df['hdpData_homeInfo_livingArea'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 718 entries, 0 to 717
Series name: hdpData_homeInfo_livingArea
Non-Null Count  Dtype  
--------------  -----  
715 non-null    float64
dtypes: float64(1)
memory usage: 5.7 KB
df['hdpData_homeInfo_livingArea'].describe().apply(lambda x: format(x, 'f'))
count      715.000000
mean      3173.609790
std       2624.123234
min        522.000000
25%       1884.500000
50%       2704.000000
75%       3794.500000
max      50275.000000
Name: hdpData_homeInfo_livingArea, dtype: object

As we can see from the code sections above hdpData_homeInfo_livingArea has 718 values that range from 522 to 50,275, with the mean value being 3,173 and the median value being 2,704. We will explore this column more completely when we clean this column in Section 5.2.3.

We do see an extreme value (~50,000) for the max square footage which should be added to the filter. This value is most likely a error with data entry or listing categorization:

filter_params["MAX_SQFT"]= 10000

3.5.3 Feature Column, schools_rating

The feature column schools_rating is a rating of schools in the area and is called the GreatSchools Rating. We are including this value as quality of local area schools may have an effect on overall price.

df['schools_rating'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 718 entries, 0 to 717
Series name: schools_rating
Non-Null Count  Dtype  
--------------  -----  
681 non-null    float64
dtypes: float64(1)
memory usage: 5.7 KB
df['schools_rating'].describe().apply(lambda x: format(x, 'f'))
count    681.000000
mean       6.225110
std        1.626391
min        2.400000
25%        4.700000
50%        6.500000
75%        7.300000
max        9.400000
Name: schools_rating, dtype: object

As we can see from the code sections above schools_rating has 681 values that range from 2.4 to 9.4. with the mean value of 6.22 and the median value of 6.5.

This column seems to good quality data, but will require imputation to fill in missing values. We will add this to our filter parameters

filter_params["IMPUTE_COLS"] = ['schools_rating']

3.5.4 Feature Column, exterior_information_BUILDING_QUALITY_CODE

The feature column exterior_information_BUILDING_QUALITY_CODE is a categorical value from Redfin that contains string ratings of the building quality. This value lives deep within the Redfin api and no further description is available.

df['exterior_information_BUILDING_QUALITY_CODE'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 718 entries, 0 to 717
Series name: exterior_information_BUILDING_QUALITY_CODE
Non-Null Count  Dtype 
--------------  ----- 
631 non-null    object
dtypes: object(1)
memory usage: 5.7+ KB
df['exterior_information_BUILDING_QUALITY_CODE'].value_counts()
exterior_information_BUILDING_QUALITY_CODE
Average          331
Good             186
Fair              67
Above Average     32
Excellent         12
Poor               3
Name: count, dtype: int64

As we can see from the code sections above exterior_information_BUILDING_QUALITY_CODE has 718 values that range from “Poor” to “Excellent”. All listings have a rating.

As this is a categorical value it will need to be encoded for further processing. We will add this note to the filter parameters:

filter_params["ENCODE_COLS"] = ['exterior_information_BUILDING_QUALITY_CODE']

4 Data Cleaning

Prior to constructing our ML models, the listing data must be cleaned. The goal of this section is to produce a single data set that is formatted and ready for ML model construction. This process should build a solid foundation for building different types of models. As a reminder we have already remove columns with large numbers of numbers of null values (Section 3.4.1) and removed rows with null prices (Section 3.4.2).

The data cleaning process will require the following steps:

  • Eliminate columns that will not be utilized, or alternatively, initially select feature columns
  • Simplify column names
  • Check for duplicate columns
  • Encode columns that contain categorical values

As a reminder we will print out the filter_params dict:

filter_params
{'NULL_MAX_PERCENT': 0.25,
 'MINIMUM_PRICE': 250000,
 'MAX_SQFT': 10000,
 'IMPUTE_COLS': ['schools_rating'],
 'ENCODE_COLS': ['exterior_information_BUILDING_QUALITY_CODE']}

4.1 DataFrame Setup

df.index = df['zpid']
len(df.columns)
89
df.columns
Index(['Unnamed: 0', 'zpid', 'price', 'priceLabel', 'beds', 'baths', 'area',
       'statusType', 'statusText', 'isFavorite', 'isUserClaimingOwner',
       'isUserConfirmedClaim', 'imgSrc', 'hasImage', 'visited',
       'shouldShowZestimateAsPrice', 'detailUrl', 'pgapt', 'sgapt',
       'has3DModel', 'hasVideo', 'isHomeRec', 'address', 'info3String',
       'info1String', 'brokerName', 'hasAdditionalAttributions',
       'isFeaturedListing', 'isShowcaseListing', 'timeOnZillow',
       'latLong_latitude', 'latLong_longitude', 'hdpData_homeInfo_zpid',
       'hdpData_homeInfo_zipcode', 'hdpData_homeInfo_city',
       'hdpData_homeInfo_state', 'hdpData_homeInfo_latitude',
       'hdpData_homeInfo_longitude', 'hdpData_homeInfo_price',
       'hdpData_homeInfo_bathrooms', 'hdpData_homeInfo_bedrooms',
       'hdpData_homeInfo_livingArea', 'hdpData_homeInfo_homeType',
       'hdpData_homeInfo_homeStatus', 'hdpData_homeInfo_daysOnZillow',
       'hdpData_homeInfo_isFeatured', 'hdpData_homeInfo_shouldHighlight',
       'hdpData_homeInfo_zestimate', 'hdpData_homeInfo_rentZestimate',
       'hdpData_homeInfo_listing_sub_type_is_FSBA',
       'hdpData_homeInfo_isUnmappable',
       'hdpData_homeInfo_isPreforeclosureAuction',
       'hdpData_homeInfo_homeStatusForHDP', 'hdpData_homeInfo_priceForHDP',
       'hdpData_homeInfo_isNonOwnerOccupied',
       'hdpData_homeInfo_isPremierBuilder', 'hdpData_homeInfo_isZillowOwned',
       'hdpData_homeInfo_currency', 'hdpData_homeInfo_country',
       'hdpData_homeInfo_taxAssessedValue', 'hdpData_homeInfo_lotAreaValue',
       'hdpData_homeInfo_lotAreaUnit', 'schools_rating', 'school_district',
       'year_built', 'year_renovated', 'sq_ft_finished', 'total_sq_ft',
       'lot_sq_ft', 'taxable_land_value', 'taxable_improvement_value',
       'county_name', 'bathroom_information_FULL_BATHS',
       'heating_&_cooling_HEATING_TYPE_CODE',
       'exterior_information_BUILDING_QUALITY_CODE',
       'exterior_information_BUILDING_CODE',
       'property_information_SUBDIVISION_NAME',
       'property_information_LIVING_SQUARE_FEET',
       'property_information_GROUND_FLOOR_SQUARE_FEET',
       'property_information_BUILDING_SQUARE_FEET',
       'property_information_LEGAL_DESCRIPTION',
       'parking_&_garage_information_PARKING_TYPE',
       'parking_&_garage_information_GARAGE_PARKING_SQUARE_FEET',
       'parking_&_garage_information_GARAGE_CODE',
       'lot_information_NUMBER_OF_BUILDINGS',
       'lot_information_LAND_SQUARE_FOOTAGE',
       'lot_information_LEGAL_LOT_NUMBER', 'lot_information_ACRES',
       'assessor_information_ASSESSED_YEAR'],
      dtype='object')

4.2 Down Selecting Feature Columns

In this section we perform an initial selection of features. The intent is to remove features that will not be utilized.

original_df = df.copy()

df = df[
    [
        "beds",
        "baths",
        "latLong_latitude",
        "latLong_longitude",
        "hdpData_homeInfo_price",
        "hdpData_homeInfo_livingArea",
        "hdpData_homeInfo_zestimate",
        "hdpData_homeInfo_taxAssessedValue",
        "schools_rating",
        "year_built",
        "year_renovated",
        "sq_ft_finished",
        "total_sq_ft",
        "lot_sq_ft",
        "taxable_land_value",
        "taxable_improvement_value",
        "heating_&_cooling_HEATING_TYPE_CODE",
        "exterior_information_BUILDING_QUALITY_CODE",
        "property_information_SUBDIVISION_NAME",
        "property_information_GROUND_FLOOR_SQUARE_FEET",
        "property_information_BUILDING_SQUARE_FEET",
        "property_information_LEGAL_DESCRIPTION",
        "parking_&_garage_information_PARKING_TYPE",
        "parking_&_garage_information_GARAGE_PARKING_SQUARE_FEET",
        "parking_&_garage_information_GARAGE_CODE",
        "lot_information_LAND_SQUARE_FOOTAGE",
    ]
]
len(df.columns)
26

4.3 Simplifying Column Names

prefixes_to_replace = [
    "hdpData_homeInfo_",
    "latLong_",
    "heating_&_cooling_",
    "exterior_information_",
    "property_information_",
    "parking_&_garage_information_",
    "lot_information",
]

for prefix in prefixes_to_replace:
    df.columns = df.columns.str.replace(prefix, "")

def camel_case(input_string):
    """
    Convert a string to camel case.

    Parameters
    ----------
    input_string : str
        The input string to be converted.

    Returns
    -------
    str
        The converted string in camel case.

    Notes
    -----
    If the input string does not contain any underscores, it is returned as is.

    The conversion process involves converting the input string to title case and removing non-alphanumeric characters.
    Then, the first character is converted to lowercase.

    Examples
    --------
    >>> camel_case("hello_world")
    'helloWorld'

    >>> camel_case("hello123_world")
    'hello123World'
    """
    if input_string.count('_') == 0:
        return input_string

    # Convert the input string to title case and remove non-alphanumeric characters
    output = ''.join(x for x in input_string.title() if x.isalnum())

    # Convert the first character to lowercase
    return output[0].lower() + output[1:]

df.columns = [camel_case(x) for x in df.columns]
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 718 entries, 13189295 to 13803132
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   beds                     714 non-null    float64
 1   baths                    718 non-null    float64
 2   latitude                 718 non-null    float64
 3   longitude                718 non-null    float64
 4   price                    718 non-null    float64
 5   livingArea               715 non-null    float64
 6   zestimate                608 non-null    float64
 7   taxAssessedValue         703 non-null    float64
 8   schoolsRating            681 non-null    float64
 9   yearBuilt                639 non-null    float64
 10  yearRenovated            557 non-null    float64
 11  sqFtFinished             637 non-null    float64
 12  totalSqFt                639 non-null    float64
 13  lotSqFt                  673 non-null    float64
 14  taxableLandValue         671 non-null    float64
 15  taxableImprovementValue  641 non-null    float64
 16  heatingTypeCode          625 non-null    object 
 17  buildingQualityCode      631 non-null    object 
 18  subdivisionName          638 non-null    object 
 19  groundFloorSquareFeet    623 non-null    object 
 20  buildingSquareFeet       637 non-null    object 
 21  legalDescription         671 non-null    object 
 22  parkingType              599 non-null    object 
 23  garageParkingSquareFeet  574 non-null    object 
 24  garageCode               578 non-null    object 
 25  landSquareFootage        671 non-null    object 
dtypes: float64(16), object(10)
memory usage: 151.5+ KB

4.4 Column Conversion

There are a few columns that seem like they should have numeric values but show as object. Here we will attempt to convert the type. If we are unsuccessful we will drop the column:

columns_to_convert_to_numeric = ["groundFloorSquareFeet", "buildingSquareFeet",
"garageParkingSquareFeet", "landSquareFootage"]

for col in columns_to_convert_to_numeric:
    df[col] = pd.to_numeric(df[col].str.replace(",", ""))

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 718 entries, 13189295 to 13803132
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   beds                     714 non-null    float64
 1   baths                    718 non-null    float64
 2   latitude                 718 non-null    float64
 3   longitude                718 non-null    float64
 4   price                    718 non-null    float64
 5   livingArea               715 non-null    float64
 6   zestimate                608 non-null    float64
 7   taxAssessedValue         703 non-null    float64
 8   schoolsRating            681 non-null    float64
 9   yearBuilt                639 non-null    float64
 10  yearRenovated            557 non-null    float64
 11  sqFtFinished             637 non-null    float64
 12  totalSqFt                639 non-null    float64
 13  lotSqFt                  673 non-null    float64
 14  taxableLandValue         671 non-null    float64
 15  taxableImprovementValue  641 non-null    float64
 16  heatingTypeCode          625 non-null    object 
 17  buildingQualityCode      631 non-null    object 
 18  subdivisionName          638 non-null    object 
 19  groundFloorSquareFeet    623 non-null    float64
 20  buildingSquareFeet       637 non-null    float64
 21  legalDescription         671 non-null    object 
 22  parkingType              599 non-null    object 
 23  garageParkingSquareFeet  574 non-null    float64
 24  garageCode               578 non-null    object 
 25  landSquareFootage        671 non-null    float64
dtypes: float64(20), object(6)
memory usage: 151.5+ KB

4.5 Checking for Collinearity

To check for duplicate columns we will set a threshold and run a correlation function on each row, saving the column names that are above the threshold:

# def check_collinearity(input_df, correlation_threshold=0.9):
#     """
#     Calculate collinearity between columns in a DataFrame.

#     Parameters:
#     input_df (pandas.DataFrame): The input DataFrame to check for collinearity.
#     correlation_threshold (float, optional): The threshold value for correlation. Default is 0.9.

#     Returns:
#     None

#     Prints:
#     correlated_pairs (list): List of correlated column pairs.
#     """

#     correlated_pairs = []

#     for col in input_df.columns:
#         if input_df[col].dtype == "float64":
#             corr = input_df.corrwith(input_df[col], numeric_only=True).sort_values(ascending=False)[1:]

#             for key, val in corr.to_dict().items():
#                 if val > correlation_threshold:
#                     correlated_pairs.append([col, key])

#     return correlated_pairs

def calc_collinearity(input_df, correlation_threshold=0.9):
    """
    Calculate collinearity between columns in a DataFrame.

    Parameters:
    input_df (pandas.DataFrame): The input DataFrame.
    correlation_threshold (float, optional): The threshold value for determining correlation. Default is 0.9.

    Returns:
    list: A list of correlated column pairs.

    Example:
    >>> input_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
    >>> calc_collinearity(input_df)
    [['A', 'B']]
    """

    correlated_pairs = []

    for col in input_df.columns:
        if input_df[col].dtype == "float64":
            corr = input_df.corrwith(input_df[col], numeric_only=True).sort_values(ascending=False)[1:]

            for key, val in corr.to_dict().items():
                if val > correlation_threshold:
                    correlated_pairs.append([col, key])

    return correlated_pairs

print(calc_collinearity(df))
[['price', 'zestimate'], ['livingArea', 'buildingSquareFeet'], ['zestimate', 'price'], ['sqFtFinished', 'totalSqFt'], ['totalSqFt', 'sqFtFinished'], ['lotSqFt', 'lotSqFt'], ['buildingSquareFeet', 'livingArea'], ['landSquareFootage', 'lotSqFt']]
test_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(calc_collinearity(test_df))
[]

From this analysis we can see that many features are collinear. We are going to use both price and the zestimate (Zillow Estimate) to compare our model against, so we will keep these. In the following code section we will remove one of the collinear features

df = df.drop(['sqFtFinished', 'buildingSquareFeet', "landSquareFootage"], axis=1)
calc_collinearity(df)
[['price', 'zestimate'], ['zestimate', 'price']]
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 718 entries, 13189295 to 13803132
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   beds                     714 non-null    float64
 1   baths                    718 non-null    float64
 2   latitude                 718 non-null    float64
 3   longitude                718 non-null    float64
 4   price                    718 non-null    float64
 5   livingArea               715 non-null    float64
 6   zestimate                608 non-null    float64
 7   taxAssessedValue         703 non-null    float64
 8   schoolsRating            681 non-null    float64
 9   yearBuilt                639 non-null    float64
 10  yearRenovated            557 non-null    float64
 11  totalSqFt                639 non-null    float64
 12  lotSqFt                  673 non-null    float64
 13  taxableLandValue         671 non-null    float64
 14  taxableImprovementValue  641 non-null    float64
 15  heatingTypeCode          625 non-null    object 
 16  buildingQualityCode      631 non-null    object 
 17  subdivisionName          638 non-null    object 
 18  groundFloorSquareFeet    623 non-null    float64
 19  legalDescription         671 non-null    object 
 20  parkingType              599 non-null    object 
 21  garageParkingSquareFeet  574 non-null    float64
 22  garageCode               578 non-null    object 
dtypes: float64(17), object(6)
memory usage: 134.6+ KB

4.6 Filtering Categorical Columns

Prior to encoding categorical columns we would like to check their contents and quality. We would like to choose categorical columns with minimal value counts. Initially we will print the values of each column:

for col in df.columns:
    if df[col].dtype != "float64":
        print(col)
        print(df[col].value_counts())
        print()
heatingTypeCode
heatingTypeCode
Forced Air                      450
Warm Air                         81
Hot Water                        60
Wall Furnace                      6
Baseboard                         6
Forced Air Gas                    5
Radiant                           4
Electric                          3
Baseboard Hot Water               3
Gravity                           2
Furnace                           2
Floor/wall Furnace                1
Baseboard Electric                1
Baseboard Electric/Hot Water      1
Name: count, dtype: int64

buildingQualityCode
buildingQualityCode
Average          331
Good             186
Fair              67
Above Average     32
Excellent         12
Poor               3
Name: count, dtype: int64

subdivisionName
subdivisionName
HASKINS STATION                8
CANDELAS FLG 1                 8
LEYDEN ROCK SUB FLG 6          7
P T BARNUMS SUB                6
BELLEVIEW &amp; SIMMS FLG 1    5
                              ..
BUSCARELLO LT LINE ADJ         1
LINDA VISTA ACRES              1
CANDELAS FLG 2 AMD 1           1
APPLEWOOD GROVE 1ST FLG        1
TROUTDALE IN THE PINES         1
Name: count, Length: 491, dtype: int64

legalDescription
legalDescription
LOT 8 BLK 6 KEEWAYDIN                                                                                                                                  1
SECTION 13 TOWNSHIP 04 RANGE 69 QTR SE SUBDIVISIONCD 396200 SUBDIVISIONNAME KELTON HEIGHTS BLOCK 037 LOT SIZE: 36994 TRACT VALUE: .849                 1
P T BARNUMS SUB B98 L38 TO 40 INC                                                                                                                      1
SECTION 17 TOWNSHIP 04 RANGE 69 SUBDIVISIONCD 308800 SUBDIVISIONNAME GREEN MOUNTAIN VILLAGE FLG # 2 BLOCK 011 LOT 0004 SIZE: 9990 TRACT VALUE: .229    1
SUB:BERKELEY GARDENS BLK:7 DESC: LOTS 25 TO 29 INC                                                                                                     1
                                                                                                                                                      ..
SECTION 34 TOWNSHIP 03 RANGE 69 QTR NE SUBDIVISIONCD 394800 SUBDIVISIONNAME KAWANEE GARDENS BLOCK 007 LOT 0006 SIZE: 17194 TRACT 00A VALUE: .395       1
SECTION 25 TOWNSHIP 03 RANGE 69 SUBDIVISIONCD 102705 SUBDIVISIONNAME BUSCARELLO LOT LINE ADJ LOT 0002 SIZE: 5815 VALUE: .1335                          1
SECTION 28 TOWNSHIP 02 RANGE 69 QTR NW SUBDIVISIONCD 435900 SUBDIVISIONNAME LAKECREST FLG #3 BLOCK 001 LOT 0001 SIZE: 9583 TRACT VALUE: .220           1
SECTION 01 TOWNSHIP 03 RANGE 69 QTR NW SUBDIVISIONCD 454800 SUBDIVISIONNAME LAMAR HEIGHTS FLG # 8 BLOCK LOT 0190 SIZE: 7618 TRACT VALUE: .175          1
SECTION 10 TOWNSHIP 05 RANGE 71 QTR NE SIZE: 21126 VALUE: .485 KEY=107                                                                                 1
Name: count, Length: 671, dtype: int64

parkingType
parkingType
Attached Garage            428
Detached Garage             62
Detached Frame Garage       24
Basement Garage             22
Detached Masonry Garage     20
Carport                     12
On Street                   11
Built Under Garage          10
On and Off Street           10
Name: count, dtype: int64

garageCode
garageCode
Attached            428
Detached             62
Detached Frame       24
Basement             22
Detached Masonry     20
Carport              12
Built Under          10
Name: count, dtype: int64

Based on the output we will eliminate legalDescription which contains all unique values, and parkingType which is similar to garageCode.

df = df.drop(['legalDescription', 'parkingType'], axis=1)

subdivisionName requires further analysis:

df['subdivisionName'].value_counts()
subdivisionName
HASKINS STATION                8
CANDELAS FLG 1                 8
LEYDEN ROCK SUB FLG 6          7
P T BARNUMS SUB                6
BELLEVIEW &amp; SIMMS FLG 1    5
                              ..
BUSCARELLO LT LINE ADJ         1
LINDA VISTA ACRES              1
CANDELAS FLG 2 AMD 1           1
APPLEWOOD GROVE 1ST FLG        1
TROUTDALE IN THE PINES         1
Name: count, Length: 491, dtype: int64

Based on this output subdivisionName has too many values for conversion to a category, so we will remove it as well:

df = df.drop(['subdivisionName'], axis=1)

We now have 3 categorical values that can be encoded.

4.7 Filtering Unrealistic Prices

Next we will remove listings where the price is unrealistic. This is based on the analysis in Section 3.5.1:

df = df[df['price'] > filter_params["MINIMUM_PRICE"]]

4.8 Filtering Square Footage Outliers

Next we will remove listings that are larger than expected. This was first analyzed in Section 3.5.2:

df = df[df['livingArea'] < filter_params["MAX_SQFT"]]

Additionally we wil filter out properties that are on lots greater that 5 acres. These are typically farm or agricultural properties and they are not needed in our pricing model.

filter_params['MAX_LOT_SQFT'] = 10 * 43560
df = df[df["lotSqFt"] < filter_params["MAX_LOT_SQFT"]]

4.9 Correlation Visualization

To continue to understand the data we will visualize the relationship between columns:

df_float = df.select_dtypes(include=['float64'])


def plot_correlation(input_df, title, annot=True, tick_rot=0, width=12, height=8):
    """
    This function plots a correlation heatmap for a given input dataframe.

    Args:
        input_df (pandas.DataFrame): The input dataframe containing the data for which the correlation heatmap needs to be plotted.
        title (str): The title for the correlation heatmap plot.
        annot (bool, optional): Whether to annotate the heatmap cells with the correlation values. Defaults to True.
        tick_rot (int, optional): The rotation angle for the x-axis tick labels. Defaults to 0.
        width (int, optional): The width of the correlation heatmap plot. Defaults to 12.
        height (int, optional): The height of the correlation heatmap plot. Defaults to 8.

    Returns:
        None
    """
    # Calculate the correlation matrix
    corr = input_df.corr(numeric_only=True)

    fig, ax = plt.subplots(figsize=(width, height))

    sns.heatmap(
        corr,
        cmap="vlag",
        annot=annot,
        xticklabels=corr.columns,
        yticklabels=corr.columns,
        vmin=-1.0,
        vmax=1.0,
        fmt=".2f",
        cbar=False,
    ).set(title=title)

    # Move x-axis ticks to the top
    ax.xaxis.tick_top()
    # Remove tick marks
    ax.tick_params(length=0)

    plt.xticks(rotation=tick_rot)
    plt.yticks(rotation=0)

    brand_plot()

    plt.show()
plot_correlation(df, "Initial Correlation Matrix", annot=False, tick_rot=45)

Figure 3: Initial Correlation Matrix of Listings DataFrame

In Figure 3 we see that we have a few strong positive correlations and a large amount of weak correlations. Overall this visualization provides strong reinforcement that our data is ready for further work.

4.10 One Hot Encoding (OHE)

One Hot Encoding (OHE) is the process of converting a categorical value into a binary value based on the category. A categorical value with 5 values will be converted to 5 columns of binary values with a 1 in the column that the data corresponds to.

Prior to this conversion it will be helpful to save the column names that are not one hot encoded:

float_cols = [col for col in df.columns if df[col].dtype == 'float64']

In the code below we use pandas get_dummies function to OHE our categorical features:

for col in df.columns:
    if df[col].dtype != "float64":
        one_hot = pd.get_dummies(df[col], prefix=col, drop_first=True)
        df = df.drop(col, axis=1)
        df = pd.concat([df, one_hot], axis=1)

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 658 entries, 13189295 to 13803132
Data columns (total 41 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   beds                                          656 non-null    float64
 1   baths                                         658 non-null    float64
 2   latitude                                      658 non-null    float64
 3   longitude                                     658 non-null    float64
 4   price                                         658 non-null    float64
 5   livingArea                                    658 non-null    float64
 6   zestimate                                     565 non-null    float64
 7   taxAssessedValue                              652 non-null    float64
 8   schoolsRating                                 658 non-null    float64
 9   yearBuilt                                     625 non-null    float64
 10  yearRenovated                                 544 non-null    float64
 11  totalSqFt                                     624 non-null    float64
 12  lotSqFt                                       658 non-null    float64
 13  taxableLandValue                              656 non-null    float64
 14  taxableImprovementValue                       626 non-null    float64
 15  groundFloorSquareFeet                         610 non-null    float64
 16  garageParkingSquareFeet                       560 non-null    float64
 17  heatingTypeCode_Baseboard Electric            658 non-null    bool   
 18  heatingTypeCode_Baseboard Electric/Hot Water  658 non-null    bool   
 19  heatingTypeCode_Baseboard Hot Water           658 non-null    bool   
 20  heatingTypeCode_Electric                      658 non-null    bool   
 21  heatingTypeCode_Floor/wall Furnace            658 non-null    bool   
 22  heatingTypeCode_Forced Air                    658 non-null    bool   
 23  heatingTypeCode_Forced Air Gas                658 non-null    bool   
 24  heatingTypeCode_Furnace                       658 non-null    bool   
 25  heatingTypeCode_Gravity                       658 non-null    bool   
 26  heatingTypeCode_Hot Water                     658 non-null    bool   
 27  heatingTypeCode_Radiant                       658 non-null    bool   
 28  heatingTypeCode_Wall Furnace                  658 non-null    bool   
 29  heatingTypeCode_Warm Air                      658 non-null    bool   
 30  buildingQualityCode_Average                   658 non-null    bool   
 31  buildingQualityCode_Excellent                 658 non-null    bool   
 32  buildingQualityCode_Fair                      658 non-null    bool   
 33  buildingQualityCode_Good                      658 non-null    bool   
 34  buildingQualityCode_Poor                      658 non-null    bool   
 35  garageCode_Basement                           658 non-null    bool   
 36  garageCode_Built Under                        658 non-null    bool   
 37  garageCode_Carport                            658 non-null    bool   
 38  garageCode_Detached                           658 non-null    bool   
 39  garageCode_Detached Frame                     658 non-null    bool   
 40  garageCode_Detached Masonry                   658 non-null    bool   
dtypes: bool(24), float64(17)
memory usage: 108.0 KB
plot_correlation(df, "OHE Listing Correlation Matrix", annot=False, tick_rot=90, width=16, height=10)

Figure 4: OHE Correlation Matrix of Listings DataFrame

Figure 4 includes the correlation relationships of the OHE values. We speculate that may add a small amount of value to to our standard regression models, and may be moderately beneficial for our random forest models.

4.11 Imputation

To ensure a fair comparison between models, we will pass identical data to each model. Some models have distinct requirements for their input values, and some cannot handle missing values. This requires that an imputation be performed on the data. For this we will use the KNNImputer from Pedregosa et al. (2011).

def impute_df(input_df):
    """
    Imputes missing values in a DataFrame using KNNImputer.

    Args:
        input_df (pandas.DataFrame): The DataFrame to impute.

    Returns:
        pandas.DataFrame: The DataFrame with imputed values.
    """
    imputer = KNNImputer()
    columns = input_df.columns
    input_df = pd.DataFrame(imputer.fit_transform(input_df))
    input_df.columns = columns
    return input_df


df = impute_df(df)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 658 entries, 0 to 657
Data columns (total 41 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   beds                                          658 non-null    float64
 1   baths                                         658 non-null    float64
 2   latitude                                      658 non-null    float64
 3   longitude                                     658 non-null    float64
 4   price                                         658 non-null    float64
 5   livingArea                                    658 non-null    float64
 6   zestimate                                     658 non-null    float64
 7   taxAssessedValue                              658 non-null    float64
 8   schoolsRating                                 658 non-null    float64
 9   yearBuilt                                     658 non-null    float64
 10  yearRenovated                                 658 non-null    float64
 11  totalSqFt                                     658 non-null    float64
 12  lotSqFt                                       658 non-null    float64
 13  taxableLandValue                              658 non-null    float64
 14  taxableImprovementValue                       658 non-null    float64
 15  groundFloorSquareFeet                         658 non-null    float64
 16  garageParkingSquareFeet                       658 non-null    float64
 17  heatingTypeCode_Baseboard Electric            658 non-null    float64
 18  heatingTypeCode_Baseboard Electric/Hot Water  658 non-null    float64
 19  heatingTypeCode_Baseboard Hot Water           658 non-null    float64
 20  heatingTypeCode_Electric                      658 non-null    float64
 21  heatingTypeCode_Floor/wall Furnace            658 non-null    float64
 22  heatingTypeCode_Forced Air                    658 non-null    float64
 23  heatingTypeCode_Forced Air Gas                658 non-null    float64
 24  heatingTypeCode_Furnace                       658 non-null    float64
 25  heatingTypeCode_Gravity                       658 non-null    float64
 26  heatingTypeCode_Hot Water                     658 non-null    float64
 27  heatingTypeCode_Radiant                       658 non-null    float64
 28  heatingTypeCode_Wall Furnace                  658 non-null    float64
 29  heatingTypeCode_Warm Air                      658 non-null    float64
 30  buildingQualityCode_Average                   658 non-null    float64
 31  buildingQualityCode_Excellent                 658 non-null    float64
 32  buildingQualityCode_Fair                      658 non-null    float64
 33  buildingQualityCode_Good                      658 non-null    float64
 34  buildingQualityCode_Poor                      658 non-null    float64
 35  garageCode_Basement                           658 non-null    float64
 36  garageCode_Built Under                        658 non-null    float64
 37  garageCode_Carport                            658 non-null    float64
 38  garageCode_Detached                           658 non-null    float64
 39  garageCode_Detached Frame                     658 non-null    float64
 40  garageCode_Detached Masonry                   658 non-null    float64
dtypes: float64(41)
memory usage: 210.9 KB

4.12 Building Training and Test DataFrames

Now that we have our data frames we can split them into training and test sets. Our target is going to be price and all other columns are going to be our features. We are using train_test_split to partition the data.

test_size = 0.2
random_state = 42
target = "price"

df = df.sort_index()

y = df[target]
z = df["zestimate"]

lat = df["latitude"]
lng = df["longitude"]

# x = df.drop(["price", "zestimate", "latitude", "longitude"], axis=1)
x = df.drop(["price", "zestimate"], axis=1)

x_focus = df.drop(["price", "zestimate", "latitude", "longitude"], axis=1)

x_train, x_test, y_train, y_test, z_train, z_test, lat_test, lat_train, lng_test, lng_train, all_train, all_test, x_focus_train, x_focus_test = train_test_split(
    x, y, z, lat, lng, df, x_focus, test_size=test_size, random_state=random_state
)

4.13 Preparation for Model Comparison

To compare models we need to compute metrics for comparison. We have chosen to compute the mean squared error (MSE), the root mean squared error (RMSE), and \(R^2\). As we are using the same dataset for all models we can safely use \(R^2\) as a comparison. The code below calculates these values and saves them in model_stats for comparison and visualization.

execution_times = {}

class ModelMetricsHandler:
    """
    A class to store and plot model accuracy statistics.
    """

    def __init__(self):
        self.model_stats = {
            "$R^2$": [],
            "Root Mean Squared Error": [],
            "Execution Time": [],
        }

        self.execution_times = {}

    def calculate(self, name, i_y_test, i_y_pred, start_time):
        """
        Calculate model statistics.

        Parameters
        ----------
        name : str
            The name of the model.
        i_y_test : array-like
            The test data.
        i_y_pred : array-like
            The predicted data.
        start_time : number
            start time from time.time()
        """
        assert len(i_y_test) == len(
            i_y_pred
        ), "Test and prediction array lengths do not match!"

        self.model_stats["Execution Time"].append([name, time.time() - start_time])

        calc_mean_squared_error = mean_squared_error(i_y_test, i_y_pred)
        calc_root_mean_squared_error = mean_squared_error(
            i_y_test, i_y_pred, squared=False
        )
        calc_mean_absolute_percentage_error = mean_absolute_percentage_error(
            i_y_test, i_y_pred
        )
        calc_r2_score = r2_score(i_y_test, i_y_pred)

        self.model_stats["Root Mean Squared Error"].append(
            [name, calc_root_mean_squared_error]
        )
        self.model_stats["$R^2$"].append([name, calc_r2_score])

    def plot_all(self, fig_width=12, fig_height=12):
        """
        Plot model statistics.

        Parameters
        ----------
        fig_width : int
            The width of the figure.
        fig_height : int
            The height of the figure.
        """
        num_rows = len(self.model_stats.keys())
        fig_height = num_rows * 6
        num_cols = 1
        fig, axes = plt.subplots(
            nrows=num_rows, ncols=num_cols, figsize=(fig_width, fig_height)
        )

        iteration = 1
        for metric, values in self.model_stats.items():
            labels = [x[0] for x in values]
            values = [x[1] for x in values]
            plt.subplot(num_rows, num_cols, iteration)
            p = plt.bar(labels, values)
            plt.title(f"{metric} by Model")
            if metric == "$R^2$":
                plt.bar_label(p, ["{:.4f}".format(x) for x in values])
            else:
                plt.bar_label(p, ["{:,}".format(int(x)) for x in values])
            plt.xticks(rotation=-45)
            iteration += 1

        brand_plot()
        plt.show()

model_metrics = ModelMetricsHandler()

5 Exploratory Data Analysis (EDA)

5.1 Feature Importance

To start our EDA, we must build our knowledge of the available features. One way to visualize the importance of the individual features is to use a random forest algorithm that supports plotting the importance. One such library is XGBoost:

xgb_model = xgb.XGBRegressor(n_jobs=1, booster="gbtree").fit(x_focus_train, y_train)
y_pred = xgb_model.predict(x_focus_test)

fig, ax = plt.subplots(1, 1, figsize=(10, 10))

xgb.plot_importance(xgb_model, title="Feature Importance", ax=ax)
brand_plot()
plt.show()

Figure 5: Feature Importance via XGBoost

Figure 5 illustrates the importance of taxAssessedValue and livingArea, which score the highest. Other floating point features also score preform well, and most binary features score relatively low. Interestingly, beds scored much higher than baths. For now we will keep all these features and perform analysis on the most import columns of data.

5.2 Column Exploration

In this section we will describe and query columns that are immediately relevant to our goal of predicting real estate listing prices.

5.2.1 hdpData_homeInfo_price

This column contains the target variable, price, in floating point format. The following code sections explores this column:

sns.kdeplot(data=df, x="price", bw_adjust=0.2)
sns.rugplot(data=df, x="price")
plt.title("Real Estate Listing - Raw Price Distribution")
brand_plot()

Figure 6: hdpData_homeInfo_price column visualized in a KDE plot

In Figure 6 we can see that price is heavily distributed in the bottom fifth of the plot and that there are many outliers that skew the distribution. In the code below we calculate statistical descriptions of this column:

df['price'].describe().apply(lambda x: format(x, 'f'))
count         658.000000
mean      1128054.920973
std        974527.892718
min        291000.000000
25%        646583.500000
50%        825000.000000
75%       1200000.000000
max      12995000.000000
Name: price, dtype: object

Now lets zoom in on the data and visualize the prices between the 25% and 75% quantiles:

q1, q3 = df['price'].quantile([0.25, 0.75])
price_zoomed_df = df[df['price'].between(q1, q3)]

sns.kdeplot(data=price_zoomed_df, x="price", bw_adjust=0.2)
sns.rugplot(data=price_zoomed_df, x="price")
plt.title(f"Real Estate Listing - {q1} - {q3} Price Distribution")
brand_plot()

Figure 7: Filtered price distribution of quantile 1 - quantile 3

In Figure 7 we visualize the central range of the pricing data. Here we can clearly see that the prices are not evenly distributed, that prices spike at near the 100K threshold for each price range.

5.2.2 taxAssessedValue

This column contains the feature column taxAssessedValue. This feature is highly correlated with price and ass seen in Section 5.1, this may be our primary feature for our regression models:

sns.kdeplot(data=df, x="taxAssessedValue", bw_adjust=0.2)
sns.rugplot(data=df, x="taxAssessedValue")
plt.title("Real Estate Listing - Raw taxAssessedValue Distribution")
brand_plot()

Figure 8: taxAssessedValue feature visualized in a KDE plot

In Figure 8 we see a distribution similar to price with a high peak in the lower range and a long upper tail.

df['taxAssessedValue'].describe().apply(lambda x: format(x, 'f'))
count        658.000000
mean      648787.675380
std       550837.345364
min        14120.000000
25%       411342.500000
50%       518617.000000
75%       716900.000000
max      9299100.000000
Name: taxAssessedValue, dtype: object

Let’s visualize the central section of the density data:

q1, q3 = df['taxAssessedValue'].quantile([0.25, 0.75])
price_zoomed_df = df[df['taxAssessedValue'].between(q1, q3)]

sns.kdeplot(data=price_zoomed_df, x="taxAssessedValue", bw_adjust=0.2)
sns.rugplot(data=price_zoomed_df, x="taxAssessedValue")
plt.title(f"Real Estate Listing - {q1} - {q3}  taxAssessedValue Distribution")
brand_plot()

Figure 9: Filtered taxAssessedValue distribution of quantile 1 - quantile 3

In Figure 9 we see fewer peaks at the rounded numbers and a smoother distribution. Notice that the range of this chart is $400,000 to $700,000 vs. $600,000 to $1,200,000 for price.

5.2.3 livingArea

This column contains the feature column livingArea. This feature is highly correlated with price and may be another primary feature for our regression models: The following code sections explores this column:

sns.kdeplot(data=df, x="livingArea", bw_adjust=0.2)
sns.rugplot(data=df, x="livingArea")
plt.title("Real Estate Listing - Raw livingArea Distribution")
brand_plot()

Figure 10: livingArea feature visualized in a KDE plot

In Figure 10 we find that most listings are between 1,500 and 3,500 square feet. Just like price above there are many outliers that skew the distribution.

df['livingArea'].describe().apply(lambda x: format(x, 'f'))
count     658.000000
mean     2969.103343
std      1494.572898
min       522.000000
25%      1896.500000
50%      2678.000000
75%      3725.750000
max      9971.000000
Name: livingArea, dtype: object

Now lets zoom in on the data and visualize the square footage between the 25% and 75% quantiles:

q1, q3 = df['livingArea'].quantile([0.25, 0.75])
price_zoomed_df = df[df['livingArea'].between(q1, q3)]

sns.kdeplot(data=price_zoomed_df, x="livingArea", bw_adjust=0.2)
sns.rugplot(data=price_zoomed_df, x="livingArea")
plt.title(f"Real Estate Listing - {q1} - {q3}  livingArea Distribution")
brand_plot()

Figure 11: Filtered livingArea distribution of quantile 1 - quantile 3

In Figure 11 we find no discernible patterns in the density. As expected, their appears to be an even distribution of values in the middle of the feature.

5.3 Clustering Visualization

For our unsupervised ML models our plan is to cluster the data then perform a regression on the clustered data. But how do we choose our clusters? One option is to iterate over many possible clusters and analyze quality of the clusters using the silhouette score. In the code section below we create a function, optmize_clusters that take a dataframe and two column names and a clustering model. It then iterates over a number of clusters and calculates the silhouette score, saving the highest score. In the code section below we build this function:

class ClusterOptimizer:
    """
    A class to optimize the number of clusters in a clustering model using silhouette scores.

    Attributes:
        max_all_silhouette_scores (float): The maximum silhouette score obtained across all iterations.
        max_all_silhouette_scores_pair (list): The pair of columns (x_col, y_col) that resulted in the maximum silhouette score.
        max_all_silhouette_scores_n_clusters (int): The number of clusters that resulted in the maximum silhouette score.
        max_all_model: The clustering model that achieved the maximum silhouette score.

    Methods:
        optimize_clusters: Optimizes the number of clusters in the model and plots the silhouette scores.
    """

    def __init__(self):
        self.clear_scores()

    def clear_scores(self):
        self.max_all_silhouette_scores = -1
        self.max_all_silhouette_scores_pair = None
        self.max_all_silhouette_scores_n_clusters = 0
        self.max_all_model = None

    def optimize_clusters(self, input_df, x_col, y_col, model, model_label, min_clusters=3, max_clusters=20):
        """
        Optimizes the number of clusters in the given model using silhouette scores.

        Args:
            input_df (DataFrame): The input DataFrame containing the data to cluster.
            x_col (str): The column name for the x-coordinate.
            y_col (str): The column name for the y-coordinate.
            model (sklearn.base.ClusterMixin): The clustering model to use.
            model_label (str): The label for the clustering model.
            min_clusters (int): The minimum number of clusters to consider (default: 3).
            max_clusters (int): The maximum number of clusters to consider (default: 20).
        """
        self.clear_scores()
        input_df = input_df.copy()
        lat = input_df[x_col].to_numpy()
        lng = input_df[y_col].to_numpy()

        coords = np.dstack((lat, lng))[0]

        silhouette_scores = []
        max_silhouette_score = -1
        max_silhouette_score_cluster_num = 0

        for i in range(min_clusters, max_clusters):
            cluster_model = model
            cluster_model.n_clusters = i
            cluster_model.fit(coords)
            this_score = silhouette_score(coords, cluster_model.labels_)
            silhouette_scores.append(this_score)

            if this_score > max_silhouette_score:
                max_silhouette_score = this_score
                max_silhouette_score_cluster_num = i

        plt.title("Coords Silhouette Score vs n_clusters")
        plt.plot(range(min_clusters, max_clusters), silhouette_scores)
        brand_plot()
        plt.show()

        # Update global maximum silhouette score and associated values
        if max_silhouette_score > self.max_all_silhouette_scores:
            self.max_all_silhouette_scores = max_silhouette_score
            self.max_all_silhouette_scores_pair = [x_col, y_col]
            self.max_all_silhouette_scores_n_clusters = max_silhouette_score_cluster_num
            self.max_all_model = model

        # Assign cluster labels to the input DataFrame
        best_cluster_model = model
        model.n_clusters = max_silhouette_score_cluster_num
        best_cluster_model.fit(coords)
        input_df["location_cluster"] = best_cluster_model.labels_

        # Plot the clustered data
        sns.scatterplot(data=input_df, x=x_col, y=y_col, hue='location_cluster', palette="tab10")
        plt.legend([], [], frameon=False)
        plt.title(f"{model_label} - Best Cluster {x_col} vs. {y_col}\nSS: {max_silhouette_score}, N_Clusters: {max_silhouette_score_cluster_num}")
        brand_plot()
        plt.show()

cluster_optimizer = ClusterOptimizer()
cluster_optimizer.optimize_clusters(df, "longitude", "latitude", KMeans(init="k-means++", n_init="auto", random_state=42), "KMeans")

6 Machine Learning Models

Our ML models will use a general architecture for price prediction. We will use an unsupervised ML algorithm to create a subset of the data, and pass this subset to multiple supervised regression algorithms. The unsupervised ML models will use clustering (kMeans and AgglomerativeClustering) and the principal component analysis (PCA) to subset the data. The Supervised ML models will use the regression models, LinearRegression, AdaBoostRegressor, and XGBRegressor. We will use both root mean squared error (RSME) and \(R^2\) to capture the results from these models. Results for these models will be shared in Section 7.

6.1 Baseline Single Feature Linear Regression

Our first model will be a baseline comparison of vanilla linear regression. We are using the scikit-learn implementation and passing in the “most important” feature taxAssessedValue:

start = time.time()

linear_regression = LinearRegression()
linear_regression.fit(x_train["taxAssessedValue"].values.reshape(-1, 1), y_train)
y_pred = linear_regression.predict(x_test["taxAssessedValue"].values.reshape(-1, 1))

# calc_model_stats(f"LinearRegression", y_test, y_pred)
model_metrics.calculate("LinearRegression", y_test, y_pred, start)
execution_times["LinearRegression"] = {"duration": time.time() - start}

6.2 Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that encodes high-dimensional data into lower dimensional data while retaining the most important information. In the code section below we perform PCA for a range of values, calculating \(R^2\) at each step and capturing the best result. In comparison to the clustering algorithms we will see later in this section, this code is relatively simple. We save the best value of this result as PCA-{iteration}-{model}. In this section we use LinearRegression (LR), AdaBoostRegressor (AB) and XGBRegressor (XG) as our supervised regression models.

"""
Improvements made:
- Added docstrings to the class and methods
- Fixed the indentation and formatting according to the Google Python Style Guide
- Added comments to explain the code logic
- Added missing import statements for necessary libraries
- Fixed the missing self parameter in the get_best_y_pred method
"""

import xgboost as xgb
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import r2_score

class PCAMaximizer:
    def __init__(self, x_train, y_train, x_test, y_test):
        """
        Initializes the PCAMaximizer class with training and testing data.

        Args:
            x_train: Training data features.
            y_train: Training data labels.
            x_test: Testing data features.
            y_test: Testing data labels.
        """
        self.x_train = x_train
        self.y_train = y_train
        self.x_test = x_test
        self.y_test = y_test

        self.best_rsquared = 0
        self.best_y_pred = None
        self.best_i = None
        self.best_model = None

    def find_best_model(self):
        """
        Finds the best model using PCA and different regression algorithms.

        Returns:
            The best model name.
        """
        for i in range(2, 30):
            # Perform PCA on training and testing data
            pca = PCA(n_components=i)
            x_train_reduced = pca.fit_transform(self.x_train)
            x_test_reduced = pca.transform(self.x_test)

            # Linear Regression
            linear_regression = LinearRegression()
            linear_regression.fit(x_train_reduced, self.y_train)
            y_pred = linear_regression.predict(x_test_reduced)
            calc_r2_score = r2_score(y_pred, self.y_test)

            if calc_r2_score > self.best_rsquared:
                self.best_rsquared = calc_r2_score
                self.best_y_pred = y_pred
                self.best_i = i
                self.best_model = "LR"

            # AdaBoost
            adaboost_model = AdaBoostRegressor(random_state=42, n_estimators=100).fit(
                x_train_reduced, self.y_train
            )
            y_pred = adaboost_model.predict(x_test_reduced)

            calc_r2_score = r2_score(y_pred, self.y_test)
            if calc_r2_score > self.best_rsquared:
                self.best_rsquared = calc_r2_score
                self.best_y_pred = y_pred
                self.best_i = i
                self.best_model = "AB"

            # XGBoost
            booster = "gbtree"
            xgb_model = xgb.XGBRegressor(n_jobs=1, booster=booster).fit(x_train_reduced, self.y_train)
            y_pred = xgb_model.predict(x_test_reduced)

            calc_r2_score = r2_score(y_pred, self.y_test)
            if calc_r2_score > self.best_rsquared:
                self.best_rsquared = calc_r2_score
                self.best_y_pred = y_pred
                self.best_i = i
                self.best_model = "XG"

        return self.best_model

    def get_best_y_pred(self):
        """
        Returns the best predicted values.

        Returns:
            The best predicted values.
        """
        return self.best_y_pred


start = time.time()
pca_maximizer = PCAMaximizer(x_train, y_train, x_test, y_test)
pca_maximizer.find_best_model()
# calc_model_stats(f"PCA {best_i}-{best_model}", y_test, best_y_pred)
model_metrics.calculate(f"PCA {pca_maximizer.best_i}-{pca_maximizer.best_model}", y_test, pca_maximizer.best_y_pred, start)
# execution_times[f"PCA {pca_maximizer.best_i}-{pca_maximizer.best_model}"] = {"duration": time.time() - start}

6.3 Feature Selection

To initiate the cluster modeling process, we start by generating an optimized list of features and saving it in the selected_features_list variable. This list consists of the top-performing features based on their length. In other words, the first item in the list represents the best performing single feature, the second item represents the best performing double feature, and so on. We accomplish this by utilizing the “scikit-learn” library’s SelectKBest module. Furthermore, we employ linear regression to calculate a model using these features and store the outcome in the SelectKBestLR model.

class FeatureSelection:
    """
    Use `SelectKBest` to predict on the testing labels
    """

    def __init__(self, x_train, y_train, x_test, y_test):
        """
        Initialize the class
        :param x_train: training data
        :param y_train: training labels
        :param x_test: testing data
        :param y_test: testing labels
        """
        self.x_train = x_train
        self.y_train = y_train
        self.x_test = x_test
        self.y_test = y_test
        self.best_features = None
        self.selected_features = None
        self.best_y_pred = None
        """
        Select the best features
        :return: None
        """

    def select(self):
        best_rsquared = 0
        best_features = None
            # Select the best features
        best_y_pred = None
        selected_features_list = []
            # Get the selected features

        for i in range(2, len(self.x_train.columns) - 1):
            selector = SelectKBest(score_func=f_regression, k=i)
            X_reduced = selector.fit_transform(self.x_train, self.y_train)
            selected_features = pd.DataFrame(
                selector.inverse_transform(X_reduced),
                index=self.x_train.index,
                columns=self.x_train.columns,
            # Get the columns of the selected features
            )
            # Get the selected features

            # Fit the linear regression model
            selected_columns = selected_features.columns[selected_features.var() != 0]
            selected_features_list.append(list(selected_columns))
            # Predict the labels

            # Check if the r2 score is better than the previous best
            X_reduced = self.x_train[selected_columns]
            linear_regression = LinearRegression()
            linear_regression.fit(self.x_train[selected_columns], self.y_train)
                # Update the best features
            y_pred = linear_regression.predict(self.x_test[selected_columns])
                # Update the best predicted labels

                # Update the list of selected features
            calc_r2_score = r2_score(y_pred, self.y_test)
            if calc_r2_score > best_rsquared:
                best_rsquared = calc_r2_score
                best_features = selected_columns
                best_y_pred = y_pred

        self.best_features = best_features
        self.best_y_pred = best_y_pred
        self.selected_features = selected_features_list

start = time.time()
feature_selector = FeatureSelection(x_train, y_train, x_test, y_test)
feature_selector.select()
selected_features_list = feature_selector.selected_features
        
# calc_model_stats("SelectKBestLR", y_test, best_y_pred)
model_metrics.calculate(f"SelectKBestLR", y_test, feature_selector.best_y_pred, start)
execution_times["SelectKBestLR"] = {"duration": time.time() - start}
print(feature_selector.selected_features)
[['taxAssessedValue', 'taxableLandValue'], ['taxAssessedValue', 'taxableLandValue', 'taxableImprovementValue'], ['taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue'], ['livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'heatingTypeCode_Radiant'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'garageCode_Basement'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement', 'garageCode_Carport'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Electric', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Electric', 'heatingTypeCode_Baseboard Electric/Hot Water', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Electric', 'heatingTypeCode_Baseboard Electric/Hot Water', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Built Under', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Electric', 'heatingTypeCode_Baseboard Electric/Hot Water', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Floor/wall Furnace', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Built Under', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry']]

6.4 Cluster Selection

This section focuses on calculating the highest scoring feature columns for a given number of clusters. We use the silhouette_score as the metric calculating the quality of the clusters. The class ClusterFeatureSelector creates a permutation of column names and iterates over these and the number of clusters and calculates the silhouette_score. We save the best cluster score and feature columns in best_cluster_features and return them from build_by_dimension. It is recommended to keep the dimension argument low (2 or 3) to avoid long run times.

MIN_CLUSTERS = 2
MAX_CLUSTERS = 5


class ClusterFeatureSelector:
    def __init__(
        self,
        input_df,
        input_cols,
        model,
        min_clusters=MIN_CLUSTERS,
        max_clusters=MAX_CLUSTERS,
    ):
        cluster_cols = input_cols.copy()

        cluster_cols.remove("price")
        cluster_cols.remove("zestimate")
        cluster_cols.remove("beds")
        cluster_cols.remove("baths")

        self.cluster_features = cluster_cols
        self.input_df = input_df
        self.model = model

        self.min_clusters = min_clusters
        self.max_clusters = max_clusters

    def build_by_dimension(self, dimension):
        best_cluster_features = {}

        feature_combinations = list(
            itertools.combinations(self.cluster_features, dimension)
        )

        for feature_combo in feature_combinations:
            input_df = self.input_df.copy()
            cluster_input_list = list([input_df[i].to_numpy() for i in feature_combo])

            cluster_input = np.dstack(cluster_input_list)[0]

            for i in range(self.min_clusters, self.max_clusters + 1):
                cluster_model = copy.copy(self.model)
                cluster_model.n_clusters = i
                cluster_model.fit(cluster_input)
                this_score = silhouette_score(cluster_input, cluster_model.labels_)

                if i not in best_cluster_features:
                    best_cluster_features[i] = {"sil_score": 0, "features": None}

                if this_score > best_cluster_features[i]["sil_score"]:
                    best_cluster_features[i]["sil_score"] = this_score
                    best_cluster_features[i]["features"] = feature_combo

        return best_cluster_features
start = time.time()
clusterFeatureSelector = ClusterFeatureSelector(df, float_cols, KMeans(init="k-means++", n_init=20, random_state=42))
two_d_cluster_features = clusterFeatureSelector.build_by_dimension(2)
execution_times["TwoDClusterSelection"] = {"duration": time.time() - start}

# three_d_cluster_features = clusterFeatureSelector.build_by_dimension(3)
# print(three_d_cluster_features)

6.5 Regression Using Clusters

The ClusterRegression class breaks the dataset up into smaller clusters and runs a regression algorithm on these clusters. In theory this seems like a simple concept, but it introduces many variables in the form of feature selection, cluster selection, and number of clusters. To solve this we pre calculate the best features and clusters in the previous two sections. We use these here to find the optimal cluster.

class ClusterRegression:
    def __init__(
        self,
        train,
        test,
        target,
        cluster_model,
        regression_model,
        min_clusters=MIN_CLUSTERS,
        max_clusters=MAX_CLUSTERS,
    ):
        self.train = train
        self.test = test
        self.target = target
        self.cluster_model = cluster_model
        self.regression_model = regression_model
        self.min_clusters = min_clusters
        self.max_clusters = max_clusters
        self.best_rsquared = 0
        self.best_regression_features = None
        self.best_cluster_features = None
        self.best_n_clusters = None
        self.best_pred = None

    def find_optimal_cluster(self, regression_features, cluster_features):
        train_df = self.train.copy()
        test_df = self.test.copy()

        for regression_feat_cols in regression_features:
            for cluster_num in cluster_features.keys():
                this_regression_features = regression_feat_cols
                this_cluster_features = cluster_features[cluster_num]["features"]
                n_clusters = cluster_num

                train_cluster_input_list = list([train_df[i].to_numpy() for i in this_cluster_features])
                test_cluster_input_list = list([test_df[i].to_numpy() for i in this_cluster_features])

                train_cluster_input = np.dstack(train_cluster_input_list)[0]
                test_cluster_input = np.dstack(test_cluster_input_list)[0]
                this_train_df = train_df.copy()
                this_test_df = test_df.copy()

                cluster_model = copy.copy(self.cluster_model)
                cluster_model.n_clusters = cluster_num
                cluster_model.fit(train_cluster_input)
                this_train_df["cluster_label"] = cluster_model.labels_
                train_labels = cluster_model.labels_
                test_labels = cluster_model.fit_predict(test_cluster_input)
                this_test_df["cluster_label"] = test_labels

                # Build test and train dataframes for each cluster
                train_clusters = {label: pd.DataFrame() for label in train_labels}
                for key in train_clusters.keys():
                    train_clusters[key] = this_train_df[:][this_train_df['cluster_label'] == key]

                test_clusters = {label: pd.DataFrame() for label in test_labels}
                for key in test_clusters.keys():
                    test_clusters[key] = this_test_df[:][this_test_df['cluster_label'] == key]

                test_cluster_df_list = []
                for key in train_clusters.keys():
                    train_cluster_df = train_clusters[key]
                    test_cluster_df = test_clusters[key]
                    regression_model = copy.copy(self.regression_model)
                    regression_model = self.regression_model.fit(train_cluster_df[this_regression_features], train_cluster_df[self.target])
                    cluster_y_pred = regression_model.predict(test_cluster_df[this_regression_features])
                    test_cluster_df["y_pred"] = cluster_y_pred
                    test_cluster_df_list.append(test_cluster_df)

                pred_df = pd.concat(test_cluster_df_list)
                pred_df = pred_df.sort_index()

                test_pred = y_test.sort_index()
                rsquared = r2_score(test_pred, pred_df['y_pred'])

                if rsquared > self.best_rsquared:
                    self.best_rsquared = rsquared
                    self.best_regression_features = this_regression_features
                    self.best_cluster_features = this_cluster_features
                    self.best_n_clusters = cluster_num
                    self.best_pred = pred_df['y_pred']

        print(f"Best:\n\tR2: {self.best_rsquared}\n\tRegression: {self.best_regression_features}\n\tCluster: {self.best_cluster_features}\n\tn_clusters: {self.best_n_clusters}")
        print()

    def get_best_rsquared(self):
        return self.best_rsquared

    def get_best_pred(self):
        return self.best_pred

    def get_best_features(self):
        return self.best_features

    def get_best_clusters(self):
        return self.best_clusters
start = time.time()
linear_cluster_regressor = ClusterRegression(all_train, all_test, target, KMeans(init="k-means++", n_init="auto", random_state=42), LinearRegression())
linear_cluster_regressor.find_optimal_cluster(selected_features_list, two_d_cluster_features)
y_pred_lcr = linear_cluster_regressor.get_best_pred()

# calc_model_stats("OptClusterLinReg", y_test.sort_index(), y_pred_lcr)
model_metrics.calculate("OptClusterLinReg", y_test.sort_index(), y_pred_lcr, start)
execution_times["OptClusterLinReg"] = {"duration": time.time() - start}
Best:
    R2: 0.6979821632696949
    Regression: ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement']
    Cluster: ('longitude', 'lotSqFt')
    n_clusters: 3
start = time.time()
xgb_cluster_regressor = ClusterRegression(all_train, all_test, target, KMeans(init="k-means++", n_init="auto", random_state=42), xgb.XGBRegressor(n_jobs=1, booster="gbtree"))
xgb_cluster_regressor.find_optimal_cluster(selected_features_list, two_d_cluster_features)
y_pred_xgbcr = xgb_cluster_regressor.get_best_pred()

# calc_model_stats("OptClusterXGB", y_test.sort_index(), y_pred_xgbcr)
model_metrics.calculate("OptClusterXGB", y_test.sort_index(), y_pred_xgbcr, start)
execution_times["OptClusterXGB"] = {"duration": time.time() - start}
Best:
    R2: 0.7275406553596411
    Regression: ['baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent']
    Cluster: ('longitude', 'taxableLandValue')
    n_clusters: 2

6.6 Neural Network

For complete comparison of modeling techniques we are a naive neural network models to see if they offer additional accuracy when compared to the supervised and unsupervised ML algorithms. This is by no means an optimized or tuned version of a neural network.

start = time.time()

input_shape = x_train.shape[1]

# Scale the input data
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# Define the neural network architecture
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu', input_shape=(input_shape,)),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

# Calculate decay steps based on the number of epochs and batch size
decay_steps = len(x_train) // 32 * 50

# Define learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.01,
    # decay_steps=1000,
    decay_steps=decay_steps,
    decay_rate=0.9
)
# optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
optimizer = tf.keras.optimizers.Adam()

# Compile the model
model.compile(optimizer=optimizer, loss='mean_squared_error')

# Train the model with early stopping and reduced learning rate
early_stopping = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(factor=0.1, patience=3)

model.fit(x_train_scaled, y_train, epochs=50, batch_size=32, verbose="auto",
          validation_data=(x_test_scaled, y_test),
          callbacks=[early_stopping, reduce_lr, tf.keras.callbacks.LearningRateScheduler(lr_schedule)])
          # callbacks=[early_stopping, reduce_lr])

# Make predictions on x_test
y_pred = model.predict(x_test_scaled)

# calc_model_stats("NeuralNetwork", y_test, y_pred)
model_metrics.calculate("NeuralNetwork", y_test, y_pred, start)
execution_times["NeuralNetwork"] = {"duration": time.time() - start}
2023-12-31 13:20:12.146877: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Max
2023-12-31 13:20:12.146893: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2023-12-31 13:20:12.146901: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2023-12-31 13:20:12.146926: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-12-31 13:20:12.146938: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.Adam` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.Adam`.
WARNING:absl:There is a known slowdown when using v2.11+ Keras optimizers on M1/M2 Macs. Falling back to the legacy Keras optimizer, i.e., `tf.keras.optimizers.legacy.Adam`.
2023-12-31 13:20:12.419514: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-12-31 13:20:12.745989: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-12-31 13:20:13.564565: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
Epoch 1/50
 1/17 [>.............................] - ETA: 4s - loss: 1845081669632.000011/17 [==================>...........] - ETA: 0s - loss: 2157043646464.000017/17 [==============================] - ETA: 0s - loss: 1976414765056.000017/17 [==============================] - 1s 15ms/step - loss: 1976414765056.0000 - val_loss: 3059149176832.0000 - lr: 0.0100
Epoch 2/50
 1/17 [>.............................] - ETA: 0s - loss: 1959918436352.000013/17 [=====================>........] - ETA: 0s - loss: 1894479560704.000017/17 [==============================] - 0s 7ms/step - loss: 1837928284160.0000 - val_loss: 2132133937152.0000 - lr: 0.0100
Epoch 3/50
 1/17 [>.............................] - ETA: 0s - loss: 1388103860224.000013/17 [=====================>........] - ETA: 0s - loss: 2039861215232.000017/17 [==============================] - 0s 6ms/step - loss: 2378015965184.0000 - val_loss: 2000834265088.0000 - lr: 0.0100
Epoch 4/50
 1/17 [>.............................] - ETA: 0s - loss: 1855804276736.000013/17 [=====================>........] - ETA: 0s - loss: 1976046321664.000017/17 [==============================] - 0s 6ms/step - loss: 1898876239872.0000 - val_loss: 2856299528192.0000 - lr: 0.0100
Epoch 5/50
 1/17 [>.............................] - ETA: 0s - loss: 6798734000128.000013/17 [=====================>........] - ETA: 0s - loss: 2027325227008.000017/17 [==============================] - 0s 6ms/step - loss: 3904141524992.0000 - val_loss: 22992841080832.0000 - lr: 0.0100
Epoch 6/50
 1/17 [>.............................] - ETA: 0s - loss: 14171874787328.000013/17 [=====================>........] - ETA: 0s - loss: 8912616030208.0000 17/17 [==============================] - 0s 6ms/step - loss: 9197906296832.0000 - val_loss: 10479675113472.0000 - lr: 9.9934e-04
Epoch 7/50
 1/17 [>.............................] - ETA: 0s - loss: 13085355016192.000013/17 [=====================>........] - ETA: 0s - loss: 10367027642368.000017/17 [==============================] - 0s 6ms/step - loss: 8856421793792.0000 - val_loss: 2820969857024.0000 - lr: 0.0100
Epoch 8/50
 1/17 [>.............................] - ETA: 0s - loss: 1415346782208.000013/17 [=====================>........] - ETA: 0s - loss: 3308818006016.000017/17 [==============================] - 0s 6ms/step - loss: 4270040547328.0000 - val_loss: 14229082996736.0000 - lr: 0.0100
1/5 [=====>........................] - ETA: 0s5/5 [==============================] - 0s 2ms/step

7 Results & Analysis

7.1 Model Accuracy Results

Figure 12 compares the results of the model accuracy computations in a bar plot:

model_metrics.plot_all()

Figure 12: Final Model Accuracy Comparisons, RSME and \(R^2\)

7.2 Model Accuracy Analysis

In Figure 12 we compare the \(R^2\) and root mean squared error (RSME) as evaluation metrics for all models. These are suited for quantification of error in regression algorithms, and are used as such in this project. Surprisingly, our clustering regression model (OptClusterLinReg) performed the best when compared to the other models. The next best performing model performing model was PCA-2-XG which performed principal component analysis then ran XGBRegressor on the result. Our neural network and linear regression models performed similarly.

7.3 Execution Time Results

exc_times_df = pd.DataFrame(execution_times).T
exc_times_df = exc_times_df.sort_values("duration", ascending=False)
exc_times_df["model"] = exc_times_df.index

fig, ax = plt.subplots(figsize=(12, 8))

sns.barplot(data=exc_times_df, x="model", y="duration", ax=ax, color="steelblue")
plt.title("Model Execution Time Comparison")
plt.ylabel("Duration (seconds)")
plt.xlabel("Model (seconds)")
plt.xticks(rotation=45)
brand_plot()
plt.show()

Figure 13: Final Model Accuracy Comparisons, RSME and \(R^2\)

7.4 Execution Time Analysis

In Figure 13 we compare the execution times of all ML models. Here we see that our clustering regression models OptClusterXGB, and OptClusterLinReg execute much slower than other models. While they are more accurate than other models, this accuracy comes at a cost. For these models one has to consider the execution time cost vs the accuracy. For certain environments it may be better to sacrifice accuracy for speed.

8 Conclusion

In this project we sought to test and validate methods for using using the output of unsupervised ML methods as input into supervised ML methods to perform price prediction on real world real estate listings.

As detailed in Section 3 we started by scraping real estate listings from Zillow and supplementing with data from Redfin. In total we scraped 718 listings with 161 columns. In Section 4 the data was cleaned to remove unnecessary and duplicate features. Categorical data was transformed using one hot encoding. The data was then split into training and test data sets. Section 5 visualized feature importance and took a deeper look at the most import features.

In Section 6 we began with simple linear regression and progressed to using our planned unsupervised to supervised modeling approach. We used PCA to reduce the dimensionality of the features, and clustering algorithms to create organized subsets of the data. These subsets were passed to supervised regression algorithms. For a final comparison we added 3 neural network implementations.

Section 7 visualizes and analyzes the model accuracy and model execution time results. The best performing models combine clustering and regression, with the KMeans and LinearRegression performing the best. These clustering and regression while accurate are also slow to execute, especially when compared to the basic linear regression.

Through this project we learned:

  • Cleaning techniques for real world real estate listing data
    • One Hot Encoding (OHE) using pandas.get_dummies
  • Integration of unsupervised and supervised into combined models that predict prices
    • Combining clusters and regression requires careful tracking of the data
  • Feature selection with SelectKBest provided a relatively large increase in \(R^2\) for a small increase in execution time:
    • May be ideal solution for environments that can sacrifice accuracy for speed
  • Principal Component Analysis offers a simple API for reducing the complexity of features
  • Optimization of feature selection becomes increasingly important with the number of inputs:
    • ClusterRegression has can iterate over feature columns, cluster columns, and number of clusters.
      • Speed optimization required using optimized algorithms for feature selection

Areas of Exploration:

  • Tuning hyperparameters of clustering and regression models
  • Testing different regression and clustering models
  • Speed optimization
  • Dataset Tuning
    • Removal of OHE features

Overall, this project provided a solid learning platform to iterate on unsupervised and supervised machine learning implementations. Throughout this project we were forced to confront the realities of using machine learning. Some implementations were slow, some were too complex, and some provided poor accuracy. In the end we were able to follow through on our goal of creating an accurate price prediction algorithm.

9 Python Environment

Below are the versions of python and included libraries used for this project:

Code
import sys
print("Python Version:", sys.version)

# Print module versions: https://stackoverflow.com/a/49199019
import pkg_resources
import types
def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]

        poorly_named_packages = {
            "PIL": "Pillow",
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]

        yield name
imports = list(set(get_imports()))

requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name!="pip":
        requirements.append((m.project_name, m.version))

for r in requirements:
    print("{}=={}".format(*r))
Python Version: 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)]
matplotlib==3.7.2
numpy==1.24.3
pandas==2.1.1
scikit-learn==1.3.0
seaborn==0.13.0
xgboost==2.0.3

References

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.